Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: generalized wildcard queries #590

Merged
merged 6 commits into from
Oct 16, 2024
Merged

feat: generalized wildcard queries #590

merged 6 commits into from
Oct 16, 2024

Conversation

Taepper
Copy link
Collaborator

@Taepper Taepper commented Sep 19, 2024

resolves #458

Summary

This PR resolves the generalized lineage refactor. Currently the system only worked with lineages following the pango lineage format as it was designed for SARS-CoV-2.

Now this is generalized and just receives an arbitrary lineage definition file which outlines the parent-child relations for the provided lineages.

Breaking Changes:

  • The preprocessing config fields has changed from pangoLineageDefinitionFilename to lineageDefinitionFilename
  • We take a yaml lineage definition file instead of a pango alias key, a script is provided scripts/alias2lineageDefinitions.py that transforms a list of lineages and an alias into the requested file format
  • We now validate on input and on query whether the provided lineage is in the defined lineages and throw errors accordingly

In the current version, we do not allow alias queries. This means that the user-provided query needs to e.g. contain a valid pango lineage for the respective instances. Searching for B.1.1.529.1 does not work and instead a search for BA.1 is required

PR Checklist

  • All necessary documentation has been adapted or there is an issue to do so.
  • The implemented feature is covered by an appropriate test.

Copy link
Contributor

github-actions bot commented Sep 19, 2024

This is a preview of the changelog of the next release. If this branch is not up-to-date with the current main branch, the changelog may not be accurate. Rebase your branch on the main branch to get the most accurate changelog.

Note that this might contain changes that are on main, but not yet released.

Changelog:

0.3.0 (2024-10-16)

⚠ BREAKING CHANGES

  • generalized wildcard queries (#458)

Features

Bug Fixes

  • correctly escape quotes in field names (7e7b448)
  • resolve aliases when inserting to or querying lineage indexes again (04fd1e0)
  • update script to also generate aliases (c19bef9)

@Taepper Taepper marked this pull request as draft September 19, 2024 15:51
@Taepper
Copy link
Collaborator Author

Taepper commented Sep 19, 2024

By the way, the majority of the line-count (12634 lines) is the new (more-verbose) lineage definition file: testBaseData/exampleDataset/lineage_definitions.yaml

@Taepper Taepper force-pushed the generalized-lineage branch 3 times, most recently from 886c2aa to 7e94a82 Compare September 23, 2024 05:44
@Taepper Taepper marked this pull request as ready for review September 23, 2024 05:44
Copy link
Contributor

@fengelniederhammer fengelniederhammer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message should reference the issue (ideally with resolves #... in the footer - then release please will automatically pick it up and mention it in the changelog)

Also this is a breaking change - the commit message should mention it.

Also I wonder whether it might be good to still have a metadata type lineage or similar? Would there be any advantages?

include/silo/common/bidirectional_map.h Show resolved Hide resolved
testBaseData/samFiles/database_config.yaml Outdated Show resolved Hide resolved
src/silo_api/api.cpp Outdated Show resolved Hide resolved
src/silo_api/api.cpp Outdated Show resolved Hide resolved
src/silo/config/util/config_repository.cpp Outdated Show resolved Hide resolved
src/silo/config/util/config_repository.cpp Outdated Show resolved Hide resolved
src/silo/config/util/config_repository.cpp Outdated Show resolved Hide resolved
src/silo/config/util/config_repository.cpp Outdated Show resolved Hide resolved
src/silo/database.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this quite complicated. Wouldn't it be easier to do it in plain Python?

  • You get better IDE support compared to when writing SQL
  • We would not need to DuckDB dependency in Python

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChatGPT gave me this, which almost works:

import json
import argparse
import yaml
import shutil
import sys
import os
import tempfile

# Define the argument parser
parser = argparse.ArgumentParser(description="Process a JSON file and convert it to a DataFrame.")
parser.add_argument('alias_key', type=str, help='Path to the alias_key in JSON format')
parser.add_argument('lineage_file', type=str, help='Path to the input_file containing all lineages')
parser.add_argument('--preserve-tmp-dir', action='store_true', help='Preserve the temporary directory to keep the intermediate files')

# Parse the arguments
args = parser.parse_args()

# Load the JSON data from the file path provided as argument
with open(args.alias_key, 'r') as file:
    alias_key = json.load(file)

# Create a list to store the reformatted data
reformatted_data = []

# Loop through the JSON and format it as desired
for key, value in alias_key.items():
    if value and isinstance(value, str):
        reformatted_data.append({"name": key, "alias": value})
    else:
        reformatted_data.append({"name": key, "alias": key})

# Load the lineage data from the file path provided as argument
with open(args.lineage_file, 'r') as file:
    lineages = file.read().splitlines()

# Create a dictionary to store the alias mappings
alias_dict = {item['name']: item['alias'] for item in reformatted_data}

# Function to unalias a lineage
def unalias_lineage(lineage):
    parts = lineage.split('.')
    if parts[0] in alias_dict:
        parts[0] = alias_dict[parts[0]]
    return '.'.join(parts)

# Unalias all lineages
unaliased_lineages = {lineage: unalias_lineage(lineage) for lineage in lineages}

# Function to find the immediate parent lineage
def find_immediate_parent(lineage):
    if '.' in lineage:
        return lineage.rsplit('.', 1)[0]
    return None

# Create the lineage definitions
lineage_definitions = {}
for lineage, unaliased in unaliased_lineages.items():
    parent = find_immediate_parent(unaliased)
    if parent:
        parent_unaliased = unalias_lineage(parent)
        lineage_definitions[lineage] = {"parents": [parent_unaliased]}
    else:
        lineage_definitions[lineage] = {"parents": []}

# Transform the data into YAML
yaml.dump(lineage_definitions, sys.stdout, default_flow_style=False)

if args.preserve_tmp_dir:
    temp_dir = tempfile.mkdtemp()
    print(f"Temporary directory: {temp_dir}", file=sys.stderr)
else:
    temp_dir = tempfile.mkdtemp()
    shutil.rmtree(temp_dir)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the SQL based script is less error-prone but effectively equivalent. I really like that I can just look at the intermediate tables and see the precise output of every single step in the script. What would the finished alternative look like?

Copy link
Contributor

@fengelniederhammer fengelniederhammer Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that I need to look at SQL statements to understand the code. Probably I'm just a lot less used to SQL :D

The finished script would look quite similar I guess. It resolves an alias in a couple places where it should print the alias instead (A.1.2.3.4 instead of AA.4). I assume it's easy to fix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a hybrid and moved the ugly SQL part into python :)

The remaining part would be much more error prone (implementing the first part already demonstrated that to me :D) and a lot more verbose in my opinion

@Taepper Taepper force-pushed the generalized-lineage branch 2 times, most recently from 406b12f to 561327d Compare October 4, 2024 15:36
@Taepper Taepper changed the title feat: generalized wildcard queries, first finished draft feat: generalized wildcard queries Oct 4, 2024
@Taepper Taepper force-pushed the generalized-lineage branch 8 times, most recently from 94f714f to 93733a8 Compare October 7, 2024 09:27
@Taepper Taepper force-pushed the generalized-lineage branch 3 times, most recently from ff6cb04 to c1cb660 Compare October 7, 2024 14:46
Copy link
Contributor

@fengelniederhammer fengelniederhammer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually looks good, but it's not quite ready to be merged yet.

Also please make sure that the changelog contains the issue.

src/silo/storage/reference_genomes.test.cpp Show resolved Hide resolved
src/silo/storage/lineage_index.cpp Outdated Show resolved Hide resolved
src/silo/database.cpp Outdated Show resolved Hide resolved
src/silo/config/util/config_repository.test.cpp Outdated Show resolved Hide resolved
src/silo/common/lineage_tree.cpp Show resolved Hide resolved
src/silo/common/lineage_tree.cpp Outdated Show resolved Hide resolved
src/silo/common/lineage_tree.cpp Show resolved Hide resolved
scripts/generate_new_lineage_definitions.bash Outdated Show resolved Hide resolved
scripts/generate_new_lineage_definitions.bash Outdated Show resolved Hide resolved
@Taepper Taepper force-pushed the generalized-lineage branch 2 times, most recently from f3b1267 to 422d2e2 Compare October 8, 2024 14:31
@Taepper Taepper force-pushed the generalized-lineage branch 5 times, most recently from 1e2220b to b9a24c7 Compare October 9, 2024 07:14
@fengelniederhammer
Copy link
Contributor

I created GenSpectrum/LAPIS#978 to update the documentation.

BREAKING CHANGES:

The preprocessing config field pangoLineageDefinitionFilename has been renamed to lineageDefinitionFilename.
We now accept a YAML lineage definition file instead of a Pango alias key. A script (scripts/alias2lineageDefinitions.py) is provided to transform a list of lineages and an alias into the required file format.
Input and query validation now checks whether the provided lineage exists in the defined lineages, and errors are thrown if validation fails.

resolves #458
@Taepper Taepper force-pushed the generalized-lineage branch 2 times, most recently from 5807476 to 0336600 Compare October 16, 2024 08:44
@Taepper Taepper merged commit 277eb68 into main Oct 16, 2024
7 checks passed
@Taepper Taepper deleted the generalized-lineage branch October 16, 2024 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generalize wildcard lineage queries: Support more general hierarchies
2 participants